All services healthy · 99.98% uptime · last 90 days
Last updated: May 19, 2026 · 14:32 IST · auto-refreshes every 60s
Real-time metrics
Performance, right now.
Avg Latency
820ms
↓ 3% vs yesterday
Requests / min
1,284
↑ 12% vs avg
Error Rate
0.04%
↓ Below SLO threshold
Uptime · 90d
99.98%
SLA: 99.9% ✓
Service health · last 90 days
Each component, tracked.
APIChat API Gateway
Operational
RAGRAG / Qdrant Vector DB
Operational
LLMLiteLLM Provider Router
Operational
WSWidget CDN & Streaming
Operational
DBSupabase / PostgreSQL
Operational
CHERedis Cache & Rate Limit
Degraded · investigating
DASHWeb Dashboard
Operational
WHWebhooks & Event Bus
Operational
Recent incidents
What's happened lately.
Investigating
Redis cache latency spike on EU cluster
May 19, 2026 · 13:48 IST
Duration: 44 min · ongoing
14:30
Investigating — Redis cache latency on EU cluster is 3× normal. Cache misses falling through to PostgreSQL. No user-facing errors yet. Engineering investigating root cause.
13:48
Identified — Spike detected via Datadog alert. Affects ~12% of EU traffic. Failover to read replica engaged.
Resolved
OpenAI provider returning 429s for some workspaces
May 17, 2026 · 09:12 IST
Duration: 1h 24m
10:36
Resolved — OpenAI restored quota. All affected workspaces back to normal. Post-mortem to follow.
09:42
Update — Automatic failover to Anthropic Claude 3.5 working for 87% of affected bots. Engineering coordinating with OpenAI.
09:12
Identified — OpenAI API returning 429 rate limit errors on shared tier. Triggered LiteLLM automatic fallback routing.
Resolved
Knowledge base ingestion queue backlog
May 12, 2026 · 02:14 IST
Duration: 38m
02:52
Resolved — Celery worker auto-scaled. Backlog cleared. All documents now indexed.
02:14
Identified — PDF chunking queue at 4,200+ pending. Auto-scale triggered for Celery workers.
Stay informed.
Get email or webhook notifications when incidents start, update, or resolve. Never miss a status change.